Information-incorporated sparse convex clustering for disease subtyping

Abstract Motivation Heterogeneity in human diseases presents clinical challenges in accurate disease characterization and treatment. Recently available high throughput multi-omics data may offer a great opportunity to explore the underlying mechanisms of diseases and improve disease heterogeneity assessment throughout the treatment course. In addition, increasingly accumulated data from existing literature may be informative about disease subtyping. However, the existing clustering procedures, such as Sparse Convex Clustering (SCC), cannot directly utilize the prior information even though SCC produces stable clusters. Results We develop a clustering procedure, information-incorporated Sparse Convex Clustering, to respond to the need for disease subtyping in precision medicine. Utilizing the text mining approach, the proposed method leverages the existing information from previously published studies through a group lasso penalty to improve disease subtyping and biomarker identification. The proposed method allows taking heterogeneous information, such as multi-omics data. We conduct simulation studies under several scenarios with various accuracy of the prior information to evaluate the performance of our method. The proposed method outperforms other clustering methods, such as SCC, K-means, Sparse K-means, iCluster+, and Bayesian Consensus Clustering. In addition, the proposed method generates more accurate disease subtypes and identifies important biomarkers for future studies in real data analysis of breast and lung cancer-related omics data. In conclusion, we present an information-incorporated clustering procedure that allows coherent pattern discovery and feature selection. Availability and implementation The code is available upon request.


Introduction
Disease heterogeneity is common in many complex diseases, such as cancer. As a result, patients with a shared diagnosis may have different clinical responses to the same treatment. Such heterogeneity could be due to disease subtypes with distinct physiological processes. Therefore, accurate identification of these disease subtypes becomes critical for precision medicine. Breast cancer, e.g. four molecular subtypes (i.e. ERþ/luminal-like, basal-like, HER2-enriched, and normallike) with distinct clinical responses to treatments have been identified through gene expression data (Perou et al. 2000, van 't Veer et al. 2002. The Cancer Genome Atlas Networks (TCGA) group further demonstrated these four subtypes of breast cancer via multi-omics datasets, including genomic DNA copy number arrays, DNA methylation, exome sequencing, messenger RNA arrays, and microRNA sequencing (Cancer Genome Atlas Network 2012). Using omics datasets to identify disease subtypes has been extensively studied in many complex diseases beyond breast cancer, such as leukemia (Golub et al. 1999, Bullinger et al. 2004, lymphoma (Alizadeh et al. 2000, Rosenwald et al. 2002, colorectal cancer (Mo et al. 2013, Sadanandam et al. 2013, and Alzheimer's disease (Bredesen 2015, Meng et al. 2022. Clustering, an unsupervised machine learning method, has been widely applied to the disease subtyping problem (Zhang et al. 2022). However, most popular clustering algorithms, such as K-means, hierarchical clustering, and model-based clustering, suffer from instability (i.e. a different initialization or a slight change in data may yield different results) due to their nonconvex objective functions. Convex clustering thus has been proposed to address this issue (Pelckmans et al. 2005, Hocking et al. 2011, Lindsten et al. 2011, Chi and Lange 2015. In addition, convex clustering enjoys an appealing property obtaining a globally optimal solution. Recently, sparse convex clustering (SCC) was developed as an extension of convex clustering to efficiently deal with large datasets and select driving features simultaneously (Wang et al. 2018). This newly acquired property is essential and much needed when analyzing high-throughput omics datasets for disease subtyping.
Nevertheless, other noisy patterns within omics datasets may compromise the power to identify clinically meaningful subtypes. For example, in genetics studies, clusters could be driven by sex-related genes, age-related genes, or other unknown confounders-related genes instead of disease-related genes that interest us (Meng et al. 2022). Here, noisy clusters refer to those driven by nondisease-related biomarkers, as illustrated in Supplementary Fig. S1. Supplementary Figure S1 is a heatmap for a simulated data matrix with n ¼ 60 samples by p ¼ 100 features. We aim to find coherent sample clusters associated with the disease of interest. Supplementary Figure  S1 has two types of clusters: one defined by features 1-30 and the other by features 31-50. The signal strength of the first type is slightly stronger than the second type. If the disease of interest is only associated with the first type of clusters, then identifying them would be challenging since most existing clustering methods would be compromised by the second type of clusters (i.e. noisy clusters) here. Similar scenarios have also been discussed by Gaynor and Bair (2017) and Nowak and Tibshirani (2008).
Incorporating clinical information can facilitate the identification of disease-related subtypes Tibshirani 2004, Meng et al. 2022). Previous studies have shown that the incorporation of prior information can improve model performance in various areas, such as variable selection for generalized linear models (Jiang et al. 2016), identifying gene-environment interactions (Wang et al. 2019), and network analysis of gene expression data (Wu et al. 2023). However, direct clinical information for samples may not always be available, and we may alternatively leverage historical data, such as published studies, to help identify disease-related subtypes. One common approach to extracting information from published studies is to mine biomedical literature from publicly available databases, such as PubMed, Embase, and the Cochrane Library. There are several benefits of doing that. First, these databases comprise a growing number of mostly independent studies. The information is regularly updated and relatively reliable. Second, it is more cost-effective than collecting additional data by the researchers themselves. Therefore, we propose a novel clustering procedure incorporating prior information via text mining on PubMed.
In this article, we extend sparse convex clustering (Wang et al. 2018) and propose an information-incorporated sparse convex clustering (iSCC) procedure to identify disease-related subtypes and their driven features using multi-omics datasets. The rest of the article is structured as follows. We first present the proposed method and the implemented algorithm. We then conduct simulation studies and real data applications to compare the performance of our proposed method with other alternative methods. Finally, we conclude this article with a general discussion.

Materials and methods
In this section, we will present our proposed procedure iSCC in detail. The proposed method consists of two steps: information retrieval and information-incorporated analysis. We will then discuss some practical considerations when implementing the proposed method.

iSCC
iSCC procedure mainly includes two steps.
Step I: information retrieval aims to find disease-related features by leveraging existing publications through text mining techniques and assigning appropriate feature weights for our objective function.
Step II: information-incorporated analysis is the sparse convex clustering procedure based on the updated weights in Step I and transformed omics datasets.
Step I: Information retrieval. To identify disease-related features, we construct a search for the co-occurrence of each feature and disease of interest. Take PubMed as an example. We can specify a keyword search of "(PGR[Title/Abstract]) AND (breast cancer[Title/Abstract])" in the search field. This search allows us to count how frequently progesterone receptor (PGR) is mentioned together with breast cancer in the same study. Further refinement of the search field is available based on the purpose of the study. More detailed information is available at https://pubmed.ncbi.nlm.nih.gov/help/#searchtags. For illustration, we focus on publications with a pair of a feature and a disease in their titles or abstracts.
Denote f j as the co-occurrence of feature j and a disease of interest, and F as the set of co-occurrences of all feature-disease pairs in a dataset such that f j 2 F. We assume that for a given feature-disease pair, the more publications got identified, the more likely the feature is to be an informative feature related to the mechanism of the disease. Since we aim to select informative features, we enforce different weights based on their co-occurrence with the disease of interest, putting less penalty on more informative features. Specifically, we define the weight z j for the feature j as: where f T is a user-specified threshold of the co-occurrence of a feature-disease pair in the set F . Researchers can define f T based on their knowledge. We suggest using f T ¼ f 0:9 , the 90th percentile of the co-occurrence in the set F since the co-occurrence distribution is highly skewed and only a few features might be informative. Text mining is a fast-growing research area, and many text mining tools can serve our purpose here, such as MedMiner, PubMatrix, and easyPubMed. To illustrate our proposed procedure, we use easyPubMed to mine publications deposited in PubMed in this study. Compared to web-based text mining tools, easyPubMed as an R package does not require maintenance from the developers. Thus, it is more convenient to implement.
Step II: Information-incorporated analysis. Before integrating multiple omics datasets, we need a standardization procedure. We perform a feature-level centering and scaling each omics dataset by its standard deviation. Then we concatenate all transformed omics datasets into one big matrix X 2 R nÂp as our input data, where n is the number of samples and p is the total number of features. Let X iÁ ; i ¼ 1; . . . ; n, be the ith sample in the concatenated dataset and rewrite X in a featurelevel way, X ¼ x 1 ; . . . ; x p ð Þ where x j is a column vector for the jth feature. Similar to sparse convex clustering (Wang et al. 2018), we can express the objective function of iSCC with the updated feature weight z j from step I as: where U is the cluster center attached to the concatenated matrix X, u j is the cluster center for the jth feature, U i1Á is the cluster center for the i 1 th sample, Þ: 1 i 1 < i 2 ng, and v l ¼ U i1Á À U i2Á represents the difference between two cluster centers. Tuning parameters c 1 and c 2 control the number of clusters and the number of informative features, respectively. The non-negative sample Zhang and Liu and 0 otherwise, / ¼ 0:5, and Á j j j j 2 is the L 2 -norm. The second term of (2) is related to the fused lasso penalty (Tibshirani et al. 2005), which encourages samples' centers to be fused with similar rows. The feature-weight z j in the third group-lasso term of (2) plays an important role in selecting features. We implement the alternating minimization algorithm (AMA) (Tseng 1991) to solve our objective function, which is computationally efficient.

Practical considerations
When applying iSCC, the sample weight w i1;i2 and the feature weight z j are user-defined inputs, while c 1 and c 2 are tuning parameters. We specify feature weight z j through a text mining approach, as illustrated in Step I. For sample weight w i 1 ;i 2 , we can update the sample weight as w i1;i2 ¼ i d to emphasize the importance of informative features when calculating sample-wise distance. Two tuning parameters are included in the objective function of iSCC: c 1 controls the number of clusters; c 2 controls the number of selected features. We choose the information-based approach (Tan andWitten 2015, Wang andAllen 2021) to select tunings instead of the stability-based approach (Wang 2010, Fang and, which is often computationally intensive. When the numbers of clusters and informative features are known, we can first fix c 2 ¼ 1 and fit iSCC with a sequence of c 1 to get the c 1 that gives the closest number of clusters to the known number. We then fit iSCC with the chosen c 1 and a sequence of c 2 to get the c 2 that gives the closest number of features. We repeat this procedure until c 1 and c 2 stabilized. When the numbers of clusters and features are unknown, we adopt the Bayesian information criterion (BIC) based approach to choose c 1 and c 2 (Tan and Witten 2015, Wang and Allen 2021). We first fit iSCC with a fixed c 2 ¼ 1 and a sequence of c 1 . Then we choose the c 1 that minimizes BIC cluster ¼ RSSact pact þ n clust Â log n Â p act ð Þ , where RSS act ¼ X act ÀÛ act 2 2 ; p act ¼ 1 n kX act À X act k 2 2 ; act represents the dataset including selected features only, n clust is the number of identified clusters, and p act is the number of selected features. We then fit iSCC with the chosen c 1 and a sequence of c 2 to get the c 2 that gives the minimum of Then repeat the procedure until c 1 and c 2 stabilized. When either the number of clusters or the number of features is unknown, the tuning procedure is similar to the above, except that choosing c 1 that minimizes BIC cluster and c 2 that gives the closest number of features or choosing c 1 that gives the closest number of clusters and c 2 that minimizes BIC feature . In summary, BIC cluster is used to select the number of clusters based on selected features with a degree of freedom of n clust . BIC feature is then used to select the number of features with the estimated number of clusters from the previous step and the degree of freedom of this step is n clust Â p act .

Simulation studies
We conduct simulation studies to evaluate the performance of our proposed method, iSCC, in comparison with SCC, Kmeans, Sparse K-means (Sun et al. 2012, Witten andTibshirani 2010), iClusterþ (Mo et al. 2013), and Bayesian Consensus Clustering (BCC) (Lock and Dunson 2013) with available R packages (scvxlustr, sparcl, iClusterPlus) and code. We implement the adjusted Rand index (ARI) (Hubert and Arabie 1985) to measure the accuracy of clustering results. The ARI calculates the similarity between the estimated clustering assignment and the actual group label. It ranges from -1 to 1, and a larger value implies a better clustering result. We report the false negative rate (FNR), the false positive rate (FPR), and the Matthews correlation coefficient (MCC) (Matthews 1975) to evaluate the performance of feature selection for each method. We assume that the number of clusters and features are known for fair comparison between methods. S1 spherical setting: The numbers of features are p 1 ¼ 100 and p 2 ¼ 200 for two datasets X 1 and X 2 , respectively. For X 1 , the first 30 informative features are generated from a multivariate normal distribution MVNðl k ; r 2 Â I 30 Þ where ; 1 10 Þ T , and r ¼ 0:5. The fourth cluster is from the background distribution [i.e. all features are from Nð0; r 2 Þ]. In addition, we generate a noisy clustering pattern using features 31-50 The remaining noninformative features are generated from Nð0; r 2 Þ. Similar setups are applied to X 2 except The number of features is p 1 ¼ 200 and p 2 ¼ 500 for two datasets X 1 and X 2 , respectively. Like S1, but the means of features are in different directions. The first 30 informative features of X 1 have l 10 Þ T : S3 nonspherical setting with two half-moons: The number of features is p 1 ¼ 100 and p 2 ¼ 200 for two datasets X 1 and X 2 , respectively. Each pair of the first ten informative features in either X 1 or X 2 makes up two interlocking half-moons through a combination of cosine and sine functions. Again, a noisy clustering pattern is generated using the second 10 features (noisy features) from MVNðl 0 k ; I 10 Þ where l 0 The remaining noninformative features are generated from Nð0; 1Þ. Supplementary Figure S2 reveals one example of two interlocking half-moons using one pair of informative features.
For each setting, we evaluate the effect of the accuracy rate of informative feature weights (h i ¼ 1; 0:7; 0:5; 0:3) and the accuracy rate of noninformative feature weights (h ni ¼ 0:9; 0:8; 0:7) individually while assuming a perfect accuracy rate for another feature set. Then we also consider their joint effects with varying accuracy rates (h both ¼ 0:9; 0:8; 0:7) simultaneously. More specifically, h both is defined as the proportion of both informative and noninformative features with correct weights. All the results are based on 100 replicates.

Simulation results
The simulation results are summarized in Table 1. The proposed method, iSCC, performs the best in terms of clustering iSCC accuracy with the highest ARI and feature selection with the highest MCC in all three settings when proper weights are assigned to both informative and noninformative feature based on prior information (h i ¼ 1). The means of ARI are 0.91, 0.92, and 0.91, and the means of MCC are 0.96, 0.96, and 1.00 in three settings for iSCC when h i ¼ 1, respectively. In addition, iSCC outperforms most other methods, including SCC, K-means, iClusterþ, and BCC, for both clustering accuracy and feature selection in all ten scenarios with different accuracy rates of feature weights in both S1 and S2. In S1, the lowest means of ARI (0.88) and MCC (0.59) among all iSCC occur with h both ¼ 0:7, which is still higher than most compared methods except for Sparse K-means, which has ARI ¼ 0.89 and MCC ¼ 0.67. Similarly, in S2, iSCC with h both ¼ 0:7 has the lowest ARI (0.78) and MCC (0.67) among all iSCC, which are still higher than most compared methods except for Sparse K-means with ARI ¼ 0.87 and MCC ¼ 0.71. When h i > 0:5, or h both > 0:8, iSCC performs better than the best performer, Sparse K-means, among compared methods regarding clustering accuracy in S3. In terms of feature selection, iSCC with h i > 0:5 or h ni > 0:8 performs better or comparable with the best performer, iClusterþ, among compared methods in S3. The clustering accuracy and the performance of feature selection for iSCC generally decrease when the Table 1. Simulation results are based on 100 replicates, including mean and standard deviation (SD) of the adjusted Rand index (ARI), false negative rate (FNR), false positive rate (FPR), and Matthews correlation coefficient (MCC a h i ; h ni ; h both are the accuracy rates of informative feature weights, noninformative feature weights, and both informative and noninformative feature weights, respectively. Three simulation settings are considered. S1 spherical setting: K ¼ 4; n ¼ 80; p 1 ¼ 100; p 2 ¼ 200; S2 spherical setting: K ¼ 4; n ¼ 80; p 1 ¼ 200; p 2 ¼ 500; S3 nonspherical setting with two half-moons: K ¼ 2; n ¼ 80; p 1 ¼ 100; p 2 ¼ 200. All clusters within each setting have an equal sample size. The best ARI and MCC in each setting are in bold. b Since K-means and BCC cannot select featues, there is no FNR, FPR, or MCC for them. N.A. stands for not applicable.
accuracy rate of feature weights decreases. We only compare feature selection among iSCC, SCC, Sparse K-means, and iClusterþ since K-means and BCC cannot select features. The simulation results when the clusters with imbalanced sample size are summarized in Supplementary Table S1. iSCC outperforms all other methods, including SCC, K-means, Sparse K-means, iClusterþ, and BCC, for clustering accuracy in all ten scenarios with different accuracy rates of feature weights in both S1 and S2. In terms of feature selection, iSCC with h i > 0:5; h ni ! 0:7 or h both > 0:7 performs better than the best performer, SCC, among compared methods in S1 and S2. In S3, iSCC outperforms other methods when h i > 0:7; h ni > 0:8 or h both > 0:7 regarding clustering accuracy and performs better than other methods when h i > 0:3 regarding feature selection.

Real data applications
In this section, we apply our method, iSCC, to several multiomics data from the Cancer Genome Atlas (TCGA) breast cancer and mRNA expression data from human lung cancer. We evaluate the performance of our method against other clustering methods, including SCC, K-means, Sparse Kmeans, iClusterþ, and BCC. We use the above-mentioned BIC-based approach to tune hyperparameters for iSCC and SCC. The gap statistic (Tibshirani et al. 2001) is implemented to select the number of clusters for K-means, Sparse K-means, iClusterþ, and BCC. In addition, we perform follow-up analysis on clinical variables and investigate the biological interpretation for the identified clusters and features.

TCGA breast cancer multi-omics data
Breast cancer is a well-known heterogeneous disease that has been extensively studied. The TCGA, a landmark cancer genomics program, compiles a large amount of genomic, epigenomic, transcriptomic, and proteomic data spanning 33 different cancer types for public use. We utilize the TCGA breast cancer data (Cancer Genome Atlas Network 2012) and preprocess the data as suggested by Lock and Dunson (Lock and Dunson 2013). In total, we have 348 tumor samples with RNA gene expression data for 645 genes, DNA methylation data for 574 probes, miRNA expression data for 423 miRNAs, and reverse phase protein array data for 171 proteins. PAM50 is a gene-based classifier that classifies breast cancer into five molecular intrinsic subtypes: Luminal A, Luminal B, HER2-enriched, Basal-like, and Normal-like with distinct biological properties and prognoses using 50 genes (Parker et al. 2009). Since there are no true labels, we use the PAM50 as a reference to see if our identified clusters are biologically meaningful. We merge subtypes luminal A and luminal B into one luminal-like subtype since they are often regarded as one big group (Choi et al. 2014, Wang andAllen 2021). Feature-wise centering and data-wise scaling are performed before applying our method. The distributions of data from different platforms before and after transformation are shown in Supplementary Fig. S3a and b, respectively. All datasets appear to be symmetric after transformation. Through easyPubMed, the frequency of co-occurrence between a feature and breast cancer on PubMed ranges from 0 to 4201 for gene expression data; 0 to 1 for methylation data; 0 to 450 for miRNA data; and 0 to 22 053 for protein data. Supplementary Figure S4 shows the histograms of co-occurrence for data from different platforms for TCGA breast cancer.
Our method outperforms other methods with an ARI of 0.8 compared to alternative approaches with an ARI ranging from 0.32 to 0.77 (Table 2). We present the matching matrix comparing iSCC-identified clusters after removing clusters with less than three samples with the PAM50 subtypes in Table 3. Table 3 also shows different proportions of samples with the expression of clinical variables and 5-year survival rates for the iSCC clusters. Cluster 1 from iSCC mainly comprises subjects belonging to the luminal-like subtype, characterized by the expression of estrogen receptor (ER) and/or progesterone receptor (PR). We observe consistent patterns that more than 80% of patients in iSCC cluster 1 have ERþ and/or PRþ. iSCC cluster 2 is well aligned with the basal-like subtype, also known as ER-, PR-, HER2-triple-negative breast cancer, with a poor prognosis. Cluster 2 has the lowest five-year survival rate among all other identified clusters. iSCC clusters 3, 4, and 5 are closely related to the HER2enriched subtype. More than half of patients in these clusters have HER2 expression. In addition, iSCC reports 31 selected features (Supplementary Table S2). At least one breast cancerrelated publication has reported each of the selected features, even though only 29 selected features were originally receiving informative feature weights. PGR and ESR1, coding genes for PR and ER respectively, in the gene expression data and PR in the protein data, are selected as expected. More than 1000 publications mention them together with breast cancer. Another selected gene, FOXA1, is a hallmark of ERþ breast cancer and influences therapeutic responses in breast cancer (Arruabarrena-Aristorena et al. 2020). Several well-studied miRNAs have also been identified. For example, miR-155 has been closely related to breast cancer progression and drug resistance (Mattiske et al. 2012, Yu et al. 2015. miR-205 is differentially expressed in the different subtypes of breast cancer (Plantamura et al. 2020). Figure 1 shows the heatmap of TCGA multi-omics data for breast cancer together with the iSCC-generated clustering assignments and PAM50 labels. iSCC clusters are closely related to PAM50 subtypes with different patterns across four platforms of omics data.

Human lung cancer mRNA data
Lung cancer subtyping according to a molecular basis could help predict a patient's treatment outcome and develop targeted therapy. We use the mRNA gene expression data based on a cohort of people with or without lung cancer (Bhattacharjee et al. 2001). We select the top 500 out of 12 625 genes with the largest expression variance for 56 samples. There are four groups, including pulmonary carcinoid (Carcinoid: n ¼ 20), colon cancer metastasis (Colon: n ¼ 13), The data distribution is available in Supplementary Fig. S5, which is close to symmetric. Since the gene expression was assayed using the Affymetrix 95av2 GeneChip brand oligonucleotide array. We first convert the array probe name into gene name using https://biit.cs.ut.ee/gprofiler/convert. Then through easyPubMed, the co-occurrence between a gene and lung cancer on PubMed ranges from 0 to 505. Supplementary Figure S6 shows the histogram of co-occurrence for mRNA data for human lung cancer.
Our method maintains the highest ARI score of 0.78 compared to others (Table 2). Supplementary Table S2 presents iSCC-selected features, and Table 4 tabulates the matching matrix between iSCC-identified clusters (after removing clusters with less than three samples) and four known groups. Clusters 1 and 2 from iSCC are samples from the Carcinoid group. They could be two subtypes of Carcinoid. Means of mRNA expression and P-values of t-test between iSCCidentified clusters 1 and 2 for the 18 selected genes are shown in Supplementary Table S3. TTR, ASCL1, ALDH1A1, and NKX2-1 are significantly differentially expressed between clusters 1 and 2 after Bonferroni adjustment for multiple testing (significant threshold ¼ 0.0028). Many literatures have reported that these differentially expressed genes are related to the mechanism and subtypes of lung cancer (Roudi et al. 2015, Hao et al. 2016, Lee et al. 2019, Baine et al. 2020, Pozo et al. 2021. iSCC clusters 4 and 5 match the Colon and the Normal groups separately. iSCC cluster 3 is closely related to the Small cell group. Among the 18 features selected by iSCC (Supplementary Table S2), most have been reported in at least one lung cancer-related publication. MUC1, for example, is associated with a poor prognosis of lung cancer (Lakshmanan et al. 2015). Pro-Gastrin-Releasing-Peptide (Pro-GRP), a precursor of Gastrin-Releasing-Peptide coded by gene GRP, has been used as a tumor marker for small-cell lung cancer (Cavalieri et al. 2018). IL6 has recently been demonstrated to promote metastasis of non-small cell lung cancer (Liu et al. 2020). Figure 2 is the heatmap of gene expression data for lung cancer. As described above, iSCC clusters are well aligned with known groups, with a clear separation between clusters.

Discussion
The clustering analysis for disease subtyping faces two critical challenges: a lack of ways to utilize existing information and incorrectly identifying disease-related clusters. We develop an information-incorporated sparse convex clustering procedure, iSCC, for disease subtyping and biomarker identification. Simulation studies show that the proposed method outperforms other clustering methods when the accuracy rate of feature weights is relatively high. In addition, the real data a ERþ means breast cancers with estrogen receptors; PRþ means breast cancer with progesterone receptors; HER2þ means breast cancers with higherthan-normal levels of HER2 protein; 5-year survival (se) means the 5-year survival probabilities using the Kaplan-Meier estimator with standard error.    Zhang and Liu analysis of omics data demonstrates that iSCC can identify more accurate disease subtypes and select biologically meaningful biomarkers. We incorporate prior information from published studies through text mining techniques to enhance the identification of clinically meaningful disease subtypes and disease-related biomarkers. In our study, the text mining procedure only includes omics data for demonstration purposes. However, the method itself can be applied to other types of features, clinical or nonclinical, as long as there are suitable data resources for text mining. We take a binary weighting scheme for different features based on their co-occurrence with the disease of interest during the information retrieval procedure. This weighting approach will be less affected by the large variation of f j and the publication bias through time that some omics data may have much more publications than others. Also, this weighting approach is easy to implement and more efficient for tuning. However, more complex feature weights, such as continuous weights, can be designed if needed. Continuous weighting schemes may give more information about each feature, but their tuning procedure could be more challenging. In addition, we observe highly skewed cooccurrence distribution, and we assume that only a few features might be informative as more cautious with text mining results. We understand that the searched results may contain noisy information when counting the feature frequency due to the nature of text mining. Therefore, we suggest a binary weighting using the 90th percentile of the co-occurrence of feature-disease pairs as the threshold to assign feature weights. Other thresholds can be applied if researchers observe different shapes of the co-occurrence distribution or believe that more features be informative based on prior knowledge.
The proposed method can offer alternative settings beyond the weighting scheme. In the information-incorporated analysis procedure, the second term of equation (2) encourages similar samples' centers to be fused. For example, instead of the L 2 -norm, researchers may also consider L 1 -or L 1 -norm, which results in a solution containing many zeros, forming a small set of clusters (Pelckmans et al. 2005). L 1 -norm has an explicit solution as L 2 -norm does, but L 1 -norm does not and requires an efficient algorithm to solve it (Chi and Lange 2015). Compared to L 1 -or L 1 -norm, the use of L 2 -norm is more computationally efficient (Pelckmans et al. 2005). Another consideration is parameter tuning, as selecting tuning parameters is not trivial. We choose the information-based approach with time expense in mind since computational efficiency is critical in multi-omics data analysis. However, other options, such as the stability-based approach (Wang 2010, Fang and, can be applied if computational cost is not of concern since it is computationally intensive due to the bootstrapping procedure.
While we have demonstrated the proposed method's superior performance compared to the alternative approach, several directions are still worth exploring in the future. First, the publications resulting from the information retrieval may not always be relevant to disease subtypes. Since text mining is rapidly evolving, more advanced tools are developing, potentially improving to retrieval of more robust and relevant information from unstructured data. Secondly, the current design of the proposed procedure may not benefit less studied diseases with few publications. Researchers may follow the same spirit as the proposed method but consider other data sources or prior knowledge based on experts for these diseases. Therefore, how to efficiently incorporate various types of information requires further investigation. Thirdly, our simulation studies only evaluate the impact of the spherical shape on noisy clustering patterns among spherical and nonspherical settings. It would be valuable to consider further the effect of different shapes of noisy clustering patterns under different setups for future work.
In conclusion, we introduce a clustering approach allowing prior information incorporation to leverage the existing and accumulated data resources. The proposed method responds to the need for disease subtyping and biomarker identification in precision medicine.

Supplementary data
Supplementary data are available at Bioinformatics online.